This project was inspired by Kaggle’s data set on the relationship between mental health and music.

knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
#library(bslib)
# load the script with the source code
source('data_loading.R')

Data checks

datatable(head(health_survey))
dim(health_survey)
## [1] 736  34

The dataset has 736 records and 34 columns in total. Each column in the dataset represents a survey question related to mental health and how often participants listen to music while working.

#remove duplicate rows
halth_survey <- health_survey %>%
  distinct()
#check to see if duplicates were removed correctly
dim(health_survey)
## [1] 736  34

After removing duplicates we see that the dimensions of the dataframe do not change, indicating that there were no duplicates to start with.

How was the data collected?

The data was collected between the 8th and 11th months of 2022.The survey was circulated via a Google form on platforms like Discord,` and there was no restriction in terms of geography.

# how many rows and columns are in the dataset
#dim(health_survey)
#datatable(summary(health_survey))
# how many missing values does each column have?
datatable(health_survey %>%
  summarise_all(~sum(is.na(.))))

Not all questions were compulsory for the participants to fill in, and this explains why 8 columns have at least one missing value.

Privacy concerns

Looking at the columns, we see that the dataset does not have any unique details on the participants, therefore there are no risks of personal information getting leaked from the survey.

#have a look at column names in the dataset and convert them to a list
list(names(health_survey))
## [[1]]
##  [1] "Timestamp"                    "Age"                         
##  [3] "Primary streaming service"    "Hours per day"               
##  [5] "While working"                "Instrumentalist"             
##  [7] "Composer"                     "Fav genre"                   
##  [9] "Exploratory"                  "Foreign languages"           
## [11] "BPM"                          "Frequency [Classical]"       
## [13] "Frequency [Country]"          "Frequency [EDM]"             
## [15] "Frequency [Folk]"             "Frequency [Gospel]"          
## [17] "Frequency [Hip hop]"          "Frequency [Jazz]"            
## [19] "Frequency [K pop]"            "Frequency [Latin]"           
## [21] "Frequency [Lofi]"             "Frequency [Metal]"           
## [23] "Frequency [Pop]"              "Frequency [R&B]"             
## [25] "Frequency [Rap]"              "Frequency [Rock]"            
## [27] "Frequency [Video game music]" "Anxiety"                     
## [29] "Depression"                   "Insomnia"                    
## [31] "OCD"                          "Music effects"               
## [33] "Permissions"                  "While working_numeric"

Because the participants do not have identifiers, it is difficult to determine whether 1 person submitted the form once or multiple times.

#convert this to a pie chart
table(health_survey$Permissions)
## 
## I understand. 
##           736

In addition to there being no identifying information, participants also gave consent for their responses to be published.

Data analysis

Participant details

Even though the data are anonymised, we can still get some insight into who the participants are with the information that was collected.

ggplotly(age_distribution)

The age distribution is skewed towards younger people (between the ages of 16 and 32)

grid.arrange(instrumentalist + ylim(min = 0, max = 600), composer)

Musical habits

#ggplotly(streaming, tooltip = 'count')
grid.arrange(streaming, hours_listening, nrow = 1)

ggplotly(fave_genre, tooltip = 'count')
#ggplotly(hours_listening)

Mental health

all_responses

Participants were asked to rank how often they experience anxiety, depression, OCD and insomnia on a scale of 0-10 where 0 meant they do not experience any of these and 10 means they experience them to an extreme or regularly. The most prevalent mental health conditions in this dataset are depression and anxiety.

Mental health responses

What are the differences in mental health responses between those who listen to music regularly and those who don’t?

ggplotly(improved)

Stats

What is the relationship between listening to music and mental health?

We have a number of responses from the participants of the survey, but for now we want to see if there is an association between listening to music while working and mental health improvement. Both of these variables are categorical so a Chi-square test is the most appropriate for this.

Hull hypothesis:

There is no relationship between listening to music while working and mental health responses.

Alternative hypothesis:

A relationship exists between listening to music and whether mental health improves, stays the same or worsens as a result.

# perform a chi square test of independence using 2 columns: `While working` and `Music effects`

#select the desired columns
music_responses <- health_survey %>%
  select(`While working`, `Music effects`)

#create a contingency table
contingency_table <- table(music_responses$`While working`, music_responses$`Music effects`)

X_square_test <- chisq.test(contingency_table)

# show the results from the test
X_square_test
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 21.192, df = 2, p-value = 2.501e-05

From the Chi-square test we can see that the p-value is 2.501e-05. If we interpret the results of the test at the 5% level of significance we can come to the conclusion that there is a significant relationship between listening to music and mental health.

Does the preferred genre have an effect on mental health outcomes?

#select the desired columns
music_genres <- health_survey %>%
  select(`Fav genre`, `Music effects`)

#create a contingency table
contingency_table_genres <- table(music_genres$`Fav genre`, music_responses$`Music effects`)

X_square_test_genres <- chisq.test(contingency_table_genres)

# show the results from the test
X_square_test_genres
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table_genres
## X-squared = 36.381, df = 30, p-value = 0.1959

Machine learning